Data Preprocessing of the Crowdfunding Data Set¶

Introduction:¶

Background¶

You work as a Data Scientist at a P-2-P platform that was founded a year ago. Now you want to expand your business. Your team has split up and each analyst has a subset of the data. Your job is to find out insights for your platform in an exploratory data analysis.

Your business model is to run a platform (crowd-investing) where people who have a business idea, but not the money needed, can sign up and raise money for their project within a given time. On the other hand, you have funders who would like to invest their money in projects and are looking for investments. As an intermediary, your platform brings borrowers and lenders together. You earn your money with a commission for each project that lands on your platform.

Your database is the history of your platform. All projects are completed projects, i.e. the time to raise money for your project has expired. Your business model is to pay out the collected money even if the target amount is not reached.

Data set overview¶

The original data set has 671205 data points and 13 features, thus covering a memory of 450.775664 mb.

The split record contains the following columns (incl. meaning):

  • Funded: amount of money received/disbursed at the end of the crowding period in USD
  • Requested: target amount (aimed amount to reach for the project) in USD
  • Activity: thematic subcategory of the crowd project
  • Sector: upper category of topics to which the crowd projects belong
  • Use: short description of what the money will be used for
  • Country code: country code according to ISO standard
  • Country: country name according to ISO standard
  • Region: region
  • Currency: Currency in which the Funded was then paid out
  • Term in months: duration over which the loan is to be paid
  • Lenders: lender (how many people invested in projects)
  • borrower_genders: gender and number of borrowers, i.e. those who initiated the crowd project
  • repayment interval: contractually agreed repayment modalities/rhythm

Data read and amalgamation¶

  • To Do:
    • import the libraries
In [ ]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
In [ ]:
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

Data read¶

  • To Do:
    • read only the first 2 rows:
      • identify the separator
      • detect the placement of the headers
      • check for presence of indexes
      • check for unique identifiers to be used for table merging
      • check for additional unrequired columns, e.g., columns with sums for each data point
      • check whether additional parameters are to be added
In [ ]:
data_1=pd.read_csv('input/your_part1.csv', 
                    sep='/n',
                    engine='python', 
                    nrows=2)           
data_1
Out[ ]:
, funded_amount, loan_amount, activity, sector, use, country_code, country, region, currency, term_in_months, lender_count, borrower_genders, repayment_interval
0 0,300.0,300.0,Fruits & Vegetables,Food,"To buy...
1 1,575.0,575.0,Rickshaw,Transportation,to repai...
  • Interpretation:
    • The separator is the comma
    • The table starts with headings
    • The data points are indexed
    • There are no unique identifiers
  • To Do:
    • Import the entire data set.
    • Display the last 5 rows:
      • Check for additional rows that are not needed, such as rows that display totals for each column.
    • Checking for correctness of data presentation.
In [ ]:
data_1=pd.read_csv('input/your_part1.csv', 
                    sep=',',
                    index_col=0)
data_1.tail(5)
Out[ ]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
334995 7125.0 7125.0 Plastics Sales Retail to stock up with a dozen plastic tubs and buck... CD The Democratic Republic of the Congo Goma, North Kivu province USD 6.0 15 female, female, female, female, female, female... irregular
334996 125.0 125.0 Home Appliances Personal Use to buy a water filter to provide safe drinking... KH Cambodia Phnom Penh KHR 8.0 5 male, male, male monthly
334997 925.0 925.0 Cattle Agriculture to increase his herd to get more milk and prov... TJ Tajikistan Khuroson TJS 14.0 28 male monthly
334998 175.0 175.0 General Store Retail to buy additional items to sell, like beverage... PH Philippines Palo, Leyte PHP 8.0 6 female irregular
334999 150.0 150.0 Food Production/Sales Food to buy sticky rice and sugar. PH Philippines Jagna, Bohol PHP 8.0 5 female irregular
  • Interpretation:
    • Such columns as: Funded, Requested and Term in months are of type float instead of integer.
    • The columns gender of borrower and region contain multiple values in one cell.
  • Note:
    • Change columns from float data type to integer type to save memory, make tables and plots clearer, and be more efficient with them when scripting.
    • Investigate whether the values in the Region column represent two separate regions or are an addition to the main region.
  • To Do:
    • read only the first 2 lines:
      • recognize the separator
      • recognize the placement of the headers
      • check if indexes are present
      • check for additional columns that are not needed
      • check if additional parameters should be added
In [ ]:
data_2=pd.read_csv('input/your_part2.csv', 
                    sep='/n',
                    engine='python', 
                    nrows=2)           
data_2
Out[ ]:
# funded_amount# loan_amount# activity# sector# use# country_code# country# region# currency# term_in_months# lender_count# borrower_genders# repayment_interval
0 0#175.0#175.0#Liquor Store / Off-License#Food#...
1 1#325.0#325.0#Livestock#Agriculture#to buy 3 z...
  • Interpretation:
    • The separator is #
    • The indexing starts with 0
    • The first row is als the header
  • To Do:
    • import the whole data set
    • display the last 5 rows:
      • check for additional unneeded rows
    • check for the corectness of data displayment
In [ ]:
data_2 = pd.read_csv('input/your_part2.csv', 
                        sep='#',
                        index_col=0)
data_2.tail(5)
Out[ ]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
336200 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly
336201 25.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 1 female monthly
336202 0.0 25.0 Games Entertainment NaN KE Kenya NaN KES 13.0 0 NaN monthly
336203 0.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 0 female monthly
336204 0.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 0 female monthly
  • Interpretation:
    • It is to assume, that some of the values in the 'use' column:
      • are placed in [ ]
      • are NaNs
      • are written in English, Italian
      • contain booleans
      • and the language might be sometimes informal, e.g., the use of 'u' instead of 'you'
    • The Borrower genders and Region columns comprise NaNs
    • From the first quick overview over the columns' names, these seem to be the same as in the first data set and the indexing starts with 0. Presumably, the concat() function woulb be an option to join the both data sets
  • Note:
    • Look for NaNs in the Use, Region and Borrower genders columns
    • Consider the above interpreation to decide on the way of dealing with the NaNs in the 'use' column
    • Could the concat() function be the otpional for combining the two datas sets

Concatination of the two data sets¶

Aim: Prove wheather the concat() function is the right method to join the two data set

  • To Do:
    • check for the columns similarity
    • check the indexing
    • check the shapes
In [ ]:
data_1.columns == data_2.columns
Out[ ]:
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True])
In [ ]:
data_1.index
Out[ ]:
Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            334990, 334991, 334992, 334993, 334994, 334995, 334996, 334997,
            334998, 334999],
           dtype='int64', length=335000)
In [ ]:
data_2.index
Out[ ]:
Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            336195, 336196, 336197, 336198, 336199, 336200, 336201, 336202,
            336203, 336204],
           dtype='int64', length=336205)
  • Interpretation:
    • Since the both data set are similar in columns' names and the indexing starts with 0, the most efficient method that would avoid errors is the concat() function
  • To Do:
    • concatinate the both data sets
    • check wheather the concatination was successful
    • handle the indexing, since the data set would have an amount of similar indexes
In [ ]:
data_concat = pd.concat([data_1, data_2])
data_concat
Out[ ]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly
... ... ... ... ... ... ... ... ... ... ... ... ... ...
336200 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly
336201 25.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 1 female monthly
336202 0.0 25.0 Games Entertainment NaN KE Kenya NaN KES 13.0 0 NaN monthly
336203 0.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 0 female monthly
336204 0.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 0 female monthly

671205 rows × 13 columns

In [ ]:
# check for correctness
len(data_1)+len(data_2) == len(data_concat)
Out[ ]:
True
In [ ]:
# check for correctness
data_concat.loc[0,:]
Out[ ]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular
0 175.0 175.0 Liquor Store / Off-License Food to purchase additional stock of coconut wine t... PH Philippines Palo, Leyte PHP 8.0 6 female irregular
In [ ]:
data_concat.reset_index(inplace=True)  # reset the indexes
data_concat
Out[ ]:
index funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
0 0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular
1 1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular
2 2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet
3 3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular
4 4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671200 336200 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly
671201 336201 25.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 1 female monthly
671202 336202 0.0 25.0 Games Entertainment NaN KE Kenya NaN KES 13.0 0 NaN monthly
671203 336203 0.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 0 female monthly
671204 336204 0.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 0 female monthly

671205 rows × 14 columns

In [ ]:
data = data_concat.drop(columns=['index'])  # remove the initial index column

Structure overview and duplicates¶

Initial structure overview¶

  • To Do:
    • check the current shape
    • check the columns' names, whether these have some anomalies
In [ ]:
data.shape
Out[ ]:
(671205, 13)
In [ ]:
data.columns
Out[ ]:
Index([' funded_amount', ' loan_amount', ' activity', ' sector', ' use',
       ' country_code', ' country', ' region', ' currency', ' term_in_months',
       ' lender_count', ' borrower_genders', ' repayment_interval'],
      dtype='object')
In [ ]:
# display the memory usage 
print(f'The data set has a memory of {data.memory_usage(deep=True).sum()*1e-6} mb.')
The data set has a memory of 450.775664 mb.
  • Interpretation:
    • The headers start with free space
    • These are lowercased
    • Headers consisting of several words are joined with _
  • To Do:
    • capitalize the headings
    • remove the _ . These approaches would reduce the time required to rename the x and y axes on numerous charts and ensure a finer and more aesthetically pleasing table visualization.
    • review the result
    • understand the contents of the data set
In [ ]:
# remove the free space
data.columns = data.columns.str.replace(' ', '')
In [ ]:
# check the results
data.columns
Out[ ]:
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
       'country_code', 'country', 'region', 'currency', 'term_in_months',
       'lender_count', 'borrower_genders', 'repayment_interval'],
      dtype='object')
In [ ]:
data['country'].unique()
Out[ ]:
array(['Pakistan', 'India', 'Kenya', 'Nicaragua', 'El Salvador',
       'Tanzania', 'Philippines', 'Peru', 'Senegal', 'Cambodia',
       'Liberia', 'Vietnam', 'Iraq', 'Honduras', 'Palestine', 'Mongolia',
       'United States', 'Mali', 'Colombia', 'Tajikistan', 'Guatemala',
       'Ecuador', 'Bolivia', 'Yemen', 'Ghana', 'Sierra Leone', 'Haiti',
       'Chile', 'Jordan', 'Uganda', 'Burundi', 'Burkina Faso',
       'Timor-Leste', 'Indonesia', 'Georgia', 'Ukraine', 'Kosovo',
       'Albania', 'The Democratic Republic of the Congo', 'Costa Rica',
       'Somalia', 'Zimbabwe', 'Cameroon', 'Turkey', 'Azerbaijan',
       'Dominican Republic', 'Brazil', 'Mexico', 'Kyrgyzstan', 'Armenia',
       'Paraguay', 'Lebanon', 'Samoa', 'Israel', 'Rwanda', 'Zambia',
       'Nepal', 'Congo', 'Mozambique', 'South Africa', 'Togo', 'Benin',
       'Belize', 'Suriname', 'Thailand', 'Nigeria', 'Mauritania',
       'Vanuatu', 'Panama', 'Virgin Islands',
       'Saint Vincent and the Grenadines',
       "Lao People's Democratic Republic", 'Malawi', 'Myanmar (Burma)',
       'Moldova', 'South Sudan', 'Solomon Islands', 'China', 'Egypt',
       'Guam', 'Afghanistan', 'Madagascar', 'Namibia', 'Puerto Rico',
       'Lesotho', "Cote D'Ivoire", 'Bhutan'], dtype=object)
In [ ]:
data.loc[data['country'] == "Lao People's Democratic Republic", 'country'] = 'Lao People\'s Democratic Republic'
In [ ]:
data.dtypes
Out[ ]:
funded_amount         float64
loan_amount           float64
activity               object
sector                 object
use                    object
country_code           object
country                object
region                 object
currency               object
term_in_months        float64
lender_count            int64
borrower_genders       object
repayment_interval     object
dtype: object
In [ ]:
# rename the headers
data.columns = ['Funded', 'Requested', 'Activity', 'Sector', 'Use', 'Country code',
               'Country', 'Region', 'Currency', 'Term in months', 'Lenders',
               'Borrower genders', 'Repayment interval']

Duplicates¶

Inspection of duplicates¶

Since there are no entries that could be considered as unique values and that could be of help to identify duplicted data points, therefore, as duplicates are going to be interpreted those rows whose contents fully mirror one another.

  • To DO:
    • check for the duplicates
In [ ]:
print(f'{data.duplicated().sum()} data points may be disposed as if they are duplicates.')
24372 data points may be disposed of if they are duplicates.

Managing the duplicates¶

  • To Do:

    • Duplicates complicate the analysis of platform activities as they increase data redundancy and distort the data. As well, the gained insights may not be trustworthy and may, for example, lead to financial decisions that are not traceable. So, the duplicates are going to be cleaned off.

    • Check the correctness of the action

In [ ]:
data.drop_duplicates(inplace=True)           
data.reset_index(drop=True)
Out[ ]:
Funded Requested Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly
... ... ... ... ... ... ... ... ... ... ... ... ... ...
646828 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly
646829 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 female monthly
646830 0.0 125.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 female, female monthly
646831 0.0 875.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 female, female monthly
646832 0.0 250.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 female monthly

646833 rows × 13 columns

In [ ]:
# check the result
print(f'The data set contains {data.duplicated().sum()} duplicates.')
The data set contains 0 duplicates.

Missing values¶

  • Note:
    • search for:
      • Synonyms: ?, dummy, dummy-must-be-change,unknown, missing_type, missing_values, missing type, -1, -, -99999
      • Conventions: NaN
    • decide on the approach:
      • delete the whole column or only the rows with the missing values
      • replace with a proper value (e.g., 0, min, max, mean, median)
      • leave it like this

Initial inspection¶

  • To Do:
    • analyse the structue of the data set to find hints that could direct over the type of the missing value and the columns that contain them
    • identify the missing values
In [ ]:
df = data.copy()     # save a copy of the data set
In [ ]:
data.shape == df.shape    # check for the shape to assure the correctness of the saved copy
Out[ ]:
True
  • To Do:
    • examine the amount of non-null data points in relation to the total amount of data points
    • explore the data types; object data type could suggest the presence of synonyms
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 646833 entries, 0 to 671188
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Funded              646833 non-null  float64
 1   Requested           646833 non-null  float64
 2   Activity            646833 non-null  object 
 3   Sector              646833 non-null  object 
 4   Use                 642934 non-null  object 
 5   Country code        646825 non-null  object 
 6   Country             646833 non-null  object 
 7   Region              590670 non-null  object 
 8   Currency            646833 non-null  object 
 9   Term in months      646833 non-null  float64
 10  Lenders             646833 non-null  int64  
 11  Borrower genders    642945 non-null  object 
 12  Repayment interval  646833 non-null  object 
dtypes: float64(3), int64(1), object(9)
memory usage: 69.1+ MB
  • Interpretation:

    • The presence of NaNs in the following list does not exclude the columns from beeing investigated for synonyms.

    • The data points of the following columns is less than the total data points, so the following 4 columns include NaNs:

      • Use
      • Country code
      • Region
      • Borrower genders
    • The following remaining columns of object data type are still questioned on the consistency of missing values:

      • Activity
      • Sector
      • Country
      • Currency
      • Borrower genders
      • Repayment interval
    • All numeric columns are of float or integer type, thus these do not contain synonyms.

  • To Do:
    • the nunique() could give hints over missing values
In [ ]:
df.nunique()
Out[ ]:
Funded                   610
Requested                479
Activity                 163
Sector                    15
Use                   424912
Country code              86
Country                   87
Region                 12695
Currency                  67
Term in months           148
Lenders                  503
Borrower genders       11298
Repayment interval         4
dtype: int64
  • Interpretation:
    • Since the Country code's unique values are less with 1 factor to Country unique values, it could suggest that there is a group that has not been encoded.
  • To Do:
    • The next step is to see some statistics. Indifferent of the data type, the number of data points will be displayed (count) for each column. Additionally, 3 more characteristics for the object data types (only strings) will be shown: unique, top, freq. The statistics for object data types is followed by the statistics for columns of float or integer type. In case when data contains conventions or synonyms, the statistics will be represented as NaNs or displayed in the targets category aimed for a different data type. The infinities could also be identified.
In [ ]:
df.describe(include='all')
Out[ ]:
Funded Requested Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval
count 646833.000000 646833.000000 646833 646833 642934 646825 646833 590670 646833 646833.000000 646833.000000 642945 646833
unique NaN NaN 163 15 424912 86 87 12695 67 NaN NaN 11298 4
top NaN NaN Farming Agriculture to buy a water filter to provide safe drinking... PH Philippines Lahore PHP NaN NaN female monthly
freq NaN NaN 69497 174624 2074 154323 154323 7168 154322 NaN NaN 415237 330448
mean 807.845170 866.239238 NaN NaN NaN NaN NaN NaN NaN 13.718920 21.110709 NaN NaN
std 1145.283451 1214.135953 NaN NaN NaN NaN NaN NaN NaN 8.479172 28.837152 NaN NaN
min 0.000000 25.000000 NaN NaN NaN NaN NaN NaN NaN 1.000000 0.000000 NaN NaN
25% 275.000000 275.000000 NaN NaN NaN NaN NaN NaN NaN 8.000000 7.000000 NaN NaN
50% 475.000000 500.000000 NaN NaN NaN NaN NaN NaN NaN 13.000000 13.000000 NaN NaN
75% 925.000000 1000.000000 NaN NaN NaN NaN NaN NaN NaN 14.000000 25.000000 NaN NaN
max 100000.000000 100000.000000 NaN NaN NaN NaN NaN NaN NaN 158.000000 2986.000000 NaN NaN
  • Interpretation:
    • The describe() function revealed the all 4 previous mentioned containing NaNs columns.
    • There are no synonyms.
  • Note:
    • The statistics of the numeric columns suggest not normally distributed data, since The 75% of the data is much less in comparison to the maximum and is close to the mean and has lower minmum values.

Inspect the NaNs¶

In [ ]:
nan_sum = df.isna().sum()            # List of columns with NaNs and contained number of NaNs
nan_sum[nan_sum > 0].sort_values(ascending=False)
Out[ ]:
Region              56163
Use                  3899
Borrower genders     3888
Country code            8
dtype: int64
In [ ]:
nan_stat = df.isna().sum()           # Distribution of NaNs in each column
np.round(nan_stat[nan_stat > 0]/df.shape[0],5).sort_values(ascending=False)
Out[ ]:
Region              0.08683
Use                 0.00603
Borrower genders    0.00601
Country code        0.00001
dtype: float64

Synonyms and infinities¶

  • To Do:
    • Check whether the previous assumptions of absence of synonyms is true.
In [ ]:
filter_data = data[data.isin(['missing_type', '?', 'unknown', 'missing type', 'dummy', 'missing_values', 'missing values', np.inf, -np.inf]).any(axis='columns')]
filter_data
Out[ ]:
Funded Requested Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval

Directing NaNs¶

  • To Do:
    • decide which approaches to take in regard to the following columns:
      • Use
      • Country code
      • region
      • Borrower genders
    • apply the approaches

'Country code'¶

  • Interpretation:
    • The Country code column represent abbreviations for countries of the crowd project. The countries are recorded in the Country column. One possible approach would be to examine the data in both columns and decide whether it would be reasonable to replace the NaNs with the appropriate abbreviations from the ISO database.
In [ ]:
# extract the data points containing the NaNs
data.loc[data['Country code'].isnull()]
Out[ ]:
Funded Requested Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval
202537 4150.0 4150.0 Wholesale Wholesale To purchase lighting products for sale to loca... NaN Namibia EEnhana NAD 6.0 162 female bullet
202823 4150.0 4150.0 Wholesale Wholesale To purchase lighting products for sale to loca... NaN Namibia Rundu NAD 6.0 159 male bullet
344929 3325.0 3325.0 Wholesale Wholesale To purchase lighting products for sale to loca... NaN Namibia EEnhana NAD 7.0 120 female bullet
351177 3325.0 3325.0 Wholesale Wholesale To purchase lighting products for sale to loca... NaN Namibia Rundu NAD 7.0 126 male bullet
420953 3325.0 3325.0 Wholesale Wholesale To purchase lighting products for sale to loca... NaN Namibia EEnhana NAD 7.0 118 female bullet
421218 4000.0 4000.0 Wholesale Wholesale purchase solar lighting products for sale to l... NaN Namibia Rundu NAD 7.0 150 male bullet
487207 5100.0 5100.0 Renewable Energy Products Retail to pay for stock of solar lights and cell phon... NaN Namibia Katima Mulilo NAD 7.0 183 male bullet
487653 5000.0 5000.0 Wholesale Wholesale to maintain a stock of solar lights and cell p... NaN Namibia Oshakati NAD 7.0 183 female bullet
In [ ]:
# check wheather ISO abbreviations have not been recorded only for Namibia
(data['Country'] == 'Namibia').sum() == data['Country code'].isnull().sum()
Out[ ]:
True
In [ ]:
# Replace the NaNS with the corresponding abbreviation
data.loc[data['Country code'].isnull(), 'Country code'] = 'NA'
In [ ]:
# 1. check
data.loc[data['Country code'].isnull()]
Out[ ]:
Funded Requested Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval
In [ ]:
# 2. check
data.loc[data['Country'] == 'Namibia']
Out[ ]:
Funded Requested Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval
202537 4150.0 4150.0 Wholesale Wholesale To purchase lighting products for sale to loca... NA Namibia EEnhana NAD 6.0 162 female bullet
202823 4150.0 4150.0 Wholesale Wholesale To purchase lighting products for sale to loca... NA Namibia Rundu NAD 6.0 159 male bullet
344929 3325.0 3325.0 Wholesale Wholesale To purchase lighting products for sale to loca... NA Namibia EEnhana NAD 7.0 120 female bullet
351177 3325.0 3325.0 Wholesale Wholesale To purchase lighting products for sale to loca... NA Namibia Rundu NAD 7.0 126 male bullet
420953 3325.0 3325.0 Wholesale Wholesale To purchase lighting products for sale to loca... NA Namibia EEnhana NAD 7.0 118 female bullet
421218 4000.0 4000.0 Wholesale Wholesale purchase solar lighting products for sale to l... NA Namibia Rundu NAD 7.0 150 male bullet
487207 5100.0 5100.0 Renewable Energy Products Retail to pay for stock of solar lights and cell phon... NA Namibia Katima Mulilo NAD 7.0 183 male bullet
487653 5000.0 5000.0 Wholesale Wholesale to maintain a stock of solar lights and cell p... NA Namibia Oshakati NAD 7.0 183 female bullet

'Region'¶

In [ ]:
data['Region'].value_counts().head(50)
Out[ ]:
Lahore                           7168
Kaduna                           5573
Rawalpindi                       4344
Cusco                            3838
Dar es Salaam                    3719
Kisii                            3468
Narra, Palawan                   3155
Medellín                         2999
Palo, Leyte                      2940
Quezon, Palawan                  2929
San Miguel                       2850
Brookes Point, Palawan           2753
Tacloban City, Leyte             2736
San Gabriel                      2734
Kabankalan, Negros Occidental    2728
Kitale                           2689
Thanh Hoá                        2647
Eldoret                          2597
Cordova, Cebu                    2594
Webuye                           2513
Multan                           2473
Bais, Negros Oriental            2444
Roxas City, Capiz                2436
Managua                          2403
Solola                           2383
Banga, Aklan                     2315
Kisumu                           2308
Gotera                           2300
Nakuru                           2287
Hinigaran, Negros Occidental     2282
Usulután                         2230
Likoni                           2219
Isabela, Negros Occidental       2209
Dumaguete, Negros Oriental       2202
Ciudad El Triunfo                2176
San Lorenzo                      2163
Kampala                          2111
Calbayog City, Samar             2052
La Paz                           2042
Tanjay, Negros Oriental          2020
Tulcán                           1996
Kasese                           1985
Calatrava, Negros Occidental     1943
Fort Portal                      1941
Portoviejo                       1924
Himamaylan, Negros Occidental    1922
Minglanilla, Cebu                1907
Barranquilla                     1906
Tangerang                        1905
El Transito                      1842
Name: Region, dtype: int64
  • Interpretation:

    • This column contains mainly regions, but also municipalities or other smaller administrative units are represented.

    • The names are given mainly in English, but also in the local language.

    • There are also abbreviated words, numbers or information about the type of administrative units.

    • In some cases there are no spaces between punctuation marks and values.

    • Since the entries in the region column do not follow a strict convention and it is difficult to divide the values into different types of departments, it is not possible to substitute them.

In [ ]:
# look for 
data.loc[:,'Region'].value_counts(dropna=False)
Out[ ]:
NaN                           56163
Lahore                         7168
Kaduna                         5573
Rawalpindi                     4344
Cusco                          3838
                              ...  
Maplewood                         1
Ifakara                           1
Sioma                             1
Say village, Batken region        1
alejandria                        1
Name: Region, Length: 12696, dtype: int64
  • Interpretation:
    • The Region column can still be part of the data set, since there are regions which occur only once, otherwise some relevant categories could be out of consideration.

'Borrower genders'¶

  • Interpretation
    • The column consists of too many unique values, thus this could be part of the set, and a new column with the number of borrowers could be added to the set.

'Use'¶

In [ ]:
# see some samples containing missing values and analyse them in relation to other columns
data.loc[data.loc[:,'Use'].isnull()]
Out[ ]:
Funded Requested Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval
140 2975.0 2975.0 Food Production/Sales Food NaN TZ Tanzania NaN TZS 10.0 110 NaN monthly
145 1200.0 1200.0 Personal Expenses Personal Use NaN PE Peru NaN PEN 20.0 44 NaN monthly
170 4250.0 4250.0 Catering Food NaN TZ Tanzania NaN TZS 10.0 116 NaN monthly
412 2350.0 2350.0 Beauty Salon Services NaN TZ Tanzania NaN TZS 10.0 75 NaN monthly
414 725.0 725.0 Agriculture Agriculture NaN SV El Salvador NaN USD 20.0 19 NaN monthly
... ... ... ... ... ... ... ... ... ... ... ... ... ...
659604 5625.0 10000.0 Weaving Arts NaN BT Bhutan NaN USD 14.0 210 NaN irregular
660788 1975.0 1975.0 Home Energy Personal Use NaN PS Palestine NaN USD 27.0 39 NaN monthly
661718 800.0 1600.0 Furniture Making Manufacturing NaN HT Haiti NaN HTG 13.0 27 NaN irregular
671151 0.0 25.0 Livestock Agriculture NaN KE Kenya NaN KES 13.0 0 NaN monthly
671174 0.0 25.0 Games Entertainment NaN KE Kenya NaN KES 13.0 0 NaN monthly

3899 rows × 13 columns

  • Interpretation:

    • Based on the mentions of this column in the introductory chapter, it could be argued that the values do not follow any recording rule, i.e., the documentation is rather deliberate and reminiscent of bags of comments. Moreover, there are 424912 unique valeus in this column. Therefore, it is difficult to replace the missing values with specific values (most common words) or alternatives.

    • Analyzing the Use column in agreement with the other columns, region and borrowe, genders, these also contain NaNs. In the 'Region' subsection, it was noted that there are countries that occur only once. By removing the rows with NaNs, important categories/values could be disregarded, making the findings incomplete.

    • These columns could help in the later exploratory analyses to gain insights or better understand the graphs, e.g., some word clouds could be created for a selection of interested countries or sectors.

    • Based on this analysis, it would be better to leave the missing values in this column and the column itself in its original state.

New features¶

'Credit types'¶

In [ ]:
data.dtypes
Out[ ]:
Funded                float64
Requested             float64
Activity               object
Sector                 object
Use                    object
Country code           object
Country                object
Region                 object
Currency               object
Term in months        float64
Lenders                 int64
Borrower genders       object
Repayment interval     object
dtype: object
In [ ]:
loan_describe = data['Requested'].describe()
loan_describe
Out[ ]:
count    646833.000000
mean        866.239238
std        1214.135953
min          25.000000
25%         275.000000
50%         500.000000
75%        1000.000000
max      100000.000000
Name: Requested, dtype: float64
  • Interpretation:
    • There is a big discrepancy between the 75% of the data points achieving an aimed amount for a project of 1000 and the maximum of 100 000 USD.
  • To Do:
    • To support the statistics and better understand the distribution of the aimed amounts, a scatterplot would be of help.
In [ ]:
fig = px.scatter(data_frame=data, x= 'Country', y='Requested', title='Overview over requested credits')

fig.update_layout(yaxis=dict(tickmode = 'linear', tick0 = 0.00, dtick = 25000.00))

fig.show()
In [ ]:
data.loc[data['Country'] == 'The Democratic Republic of the Congo', 'Country'] = 'Congo'
In [ ]:
data.loc[data['Country'] == 'The Democratic Republic of the Congo']
Out[ ]:
Funded Requested Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval
In [ ]:
categories = ['bronze','silver', 'gold','platinum']
values = [loan_describe['min'], 15000.00, 35000.00, 50000.00, loan_describe['max']]

data['Credit types'] = pd.cut(data['Requested'], bins=4, labels=categories)
data
Out[ ]:
Funded Requested Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval Credit types
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular bronze
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular bronze
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet bronze
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular bronze
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly bronze
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671179 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly bronze
671181 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 female monthly bronze
671182 0.0 125.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 female, female monthly bronze
671184 0.0 875.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 female, female monthly bronze
671188 0.0 250.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 female monthly bronze

646833 rows × 14 columns

'Difference amount'¶

In [ ]:
data.insert(2, 'Difference amount', ((data['Requested'] - data['Funded']) / data['Requested'])*100)
In [ ]:
data
Out[ ]:
Funded Requested Difference amount Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval Credit types
0 300.0 300.0 0.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular bronze
1 575.0 575.0 0.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular bronze
2 150.0 150.0 0.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet bronze
3 200.0 200.0 0.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular bronze
4 400.0 400.0 0.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly bronze
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671179 0.0 25.0 100.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly bronze
671181 0.0 25.0 100.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 female monthly bronze
671182 0.0 125.0 100.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 female, female monthly bronze
671184 0.0 875.0 100.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 female, female monthly bronze
671188 0.0 250.0 100.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 female monthly bronze

646833 rows × 15 columns

'Covered amount'¶

In [ ]:
# make the measurement and save it into a new column
data.insert(3, 'Covered amount', (data['Funded']/data['Requested'])*100)
In [ ]:
# check
data
Out[ ]:
Funded Requested Difference amount Covered amount Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval Credit types
0 300.0 300.0 0.0 100.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular bronze
1 575.0 575.0 0.0 100.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular bronze
2 150.0 150.0 0.0 100.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet bronze
3 200.0 200.0 0.0 100.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular bronze
4 400.0 400.0 0.0 100.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly bronze
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671179 0.0 25.0 100.0 0.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly bronze
671181 0.0 25.0 100.0 0.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 female monthly bronze
671182 0.0 125.0 100.0 0.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 female, female monthly bronze
671184 0.0 875.0 100.0 0.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 female, female monthly bronze
671188 0.0 250.0 100.0 0.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 female monthly bronze

646833 rows × 16 columns

Population Density¶

To normalize the data, a new data set was used. The total population pro country was considered. https://population.un.org/wpp/Download/Standard/CSV/

In [ ]:
pop_density_all = pd.read_csv('input/WPP2019_TotalPopulationBySex.csv')
pop_density_all
Out[ ]:
LocID Location VarID Variant Time MidPeriod PopMale PopFemale PopTotal PopDensity
0 4 Afghanistan 2 Medium 1950 1950.5 4099.243 3652.874 7752.117 11.874
1 4 Afghanistan 2 Medium 1951 1951.5 4134.756 3705.395 7840.151 12.009
2 4 Afghanistan 2 Medium 1952 1952.5 4174.450 3761.546 7935.996 12.156
3 4 Afghanistan 2 Medium 1953 1953.5 4218.336 3821.348 8039.684 12.315
4 4 Afghanistan 2 Medium 1954 1954.5 4266.484 3884.832 8151.316 12.486
... ... ... ... ... ... ... ... ... ... ...
280927 716 Zimbabwe 207 Lower 95 PI 2080 2080.5 10576.533 11255.983 21836.893 56.448
280928 716 Zimbabwe 207 Lower 95 PI 2085 2085.5 10293.349 11050.875 21355.988 55.205
280929 716 Zimbabwe 207 Lower 95 PI 2090 2090.5 9920.336 10767.709 20689.956 53.483
280930 716 Zimbabwe 207 Lower 95 PI 2095 2095.5 9503.711 10412.184 19892.080 51.421
280931 716 Zimbabwe 207 Lower 95 PI 2100 2100.5 9090.075 9996.105 19061.177 49.273

280932 rows × 10 columns

In [ ]:
pop_density_all.dtypes
Out[ ]:
LocID           int64
Location       object
VarID           int64
Variant        object
Time            int64
MidPeriod     float64
PopMale       float64
PopFemale     float64
PopTotal      float64
PopDensity    float64
dtype: object
In [ ]:
pop_density = pop_density_all.loc[pop_density_all.loc[:,'Time'] == 2022, ['Location', 'Time', 'PopTotal']]
pop_density 
Out[ ]:
Location Time PopTotal
72 Afghanistan 2022 40754.385
153 Afghanistan 2022 40875.793
234 Afghanistan 2022 40669.912
315 Afghanistan 2022 41046.378
396 Afghanistan 2022 39962.090
... ... ... ...
280444 Zimbabwe 2022 15108.760
280525 Zimbabwe 2022 15491.853
280606 Zimbabwe 2022 15313.014
280687 Zimbabwe 2022 15365.451
280768 Zimbabwe 2022 15216.619

2757 rows × 3 columns

In [ ]:
pop_density_drop = pop_density.drop_duplicates(subset=['Location'], ignore_index=True)
pop_density_drop.reset_index()
pop_density_drop
Out[ ]:
Location Time PopTotal
0 Afghanistan 2022 40754.385
1 Africa 2022 1406728.760
2 African Group 2022 1404902.167
3 African Union 2022 1405528.335
4 African Union: Central Africa 2022 168061.108
... ... ... ...
469 World 2022 7953952.577
470 World Bank Regional Groups (developing only) 2022 6679106.813
471 Yemen 2022 31154.866
472 Zambia 2022 19470.238
473 Zimbabwe 2022 15331.426

474 rows × 3 columns

In [ ]:
pop_sel = pop_density_drop[['Location', 'PopTotal']]
pop_sel
Out[ ]:
Location PopTotal
0 Afghanistan 40754.385
1 Africa 1406728.760
2 African Group 1404902.167
3 African Union 1405528.335
4 African Union: Central Africa 168061.108
... ... ...
469 World 7953952.577
470 World Bank Regional Groups (developing only) 6679106.813
471 Yemen 31154.866
472 Zambia 19470.238
473 Zimbabwe 15331.426

474 rows × 2 columns

In [ ]:
pop_sel_2 = pop_sel.copy()
In [ ]:
pop_sel_2.rename(mapper={'Location':'Country'}, axis = 1, inplace = True)
pop_sel_2.columns
Out[ ]:
Index(['Country', 'PopTotal'], dtype='object')
In [ ]:
pop_sel_2.set_index('Country')
Out[ ]:
PopTotal
Country
Afghanistan 40754.385
Africa 1406728.760
African Group 1404902.167
African Union 1405528.335
African Union: Central Africa 168061.108
... ...
World 7953952.577
World Bank Regional Groups (developing only) 6679106.813
Yemen 31154.866
Zambia 19470.238
Zimbabwe 15331.426

474 rows × 1 columns

In [ ]:
data_population = pop_sel_2.to_pickle('output/population.pkl')
In [ ]:
data_pop = data.merge(pop_sel_2, on='Country', how='left')
data_pop
Out[ ]:
Funded Requested Difference amount Covered amount Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval Credit types PopTotal
0 300.0 300.0 0.0 100.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular bronze 229488.996
1 575.0 575.0 0.0 100.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular bronze 229488.996
2 150.0 150.0 0.0 100.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet bronze 1406631.781
3 200.0 200.0 0.0 100.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular bronze 229488.996
4 400.0 400.0 0.0 100.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly bronze 229488.996
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
646828 0.0 25.0 100.0 0.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly bronze 7305.842
646829 0.0 25.0 100.0 0.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 female monthly bronze 229488.996
646830 0.0 125.0 100.0 0.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 female, female monthly bronze 131562.775
646831 0.0 875.0 100.0 0.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 female, female monthly bronze NaN
646832 0.0 250.0 100.0 0.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 female monthly bronze 32395.454

646833 rows × 17 columns

In [ ]:
nan_sum = data_pop.isna().sum()           
nan_sum[nan_sum > 0].sort_values(ascending=False)
Out[ ]:
Region              56163
PopTotal            41439
Use                  3899
Borrower genders     3888
dtype: int64
In [ ]:
data_pop['PopTotal'].nunique()
Out[ ]:
76
In [ ]:
countries = data_pop.loc[(data_pop['PopTotal'].isnull(), 'Country')].unique()
countries
Out[ ]:
array(['Tanzania', 'Vietnam', 'Palestine', 'United States', 'Bolivia',
       'Kosovo', 'Virgin Islands', 'Myanmar (Burma)', 'Moldova',
       "Cote D'Ivoire"], dtype=object)
In [ ]:
densities = [63298.542,  98953.535, 5345.545, 334805.268, 11992.647, 1873.592, 103.969, 55227.152, 4013.178, 2774.301]
combi = dict(zip(countries, densities))
combi
Out[ ]:
{'Tanzania': 63298.542,
 'Vietnam': 98953.535,
 'Palestine': 5345.545,
 'United States': 334805.268,
 'Bolivia': 11992.647,
 'Kosovo': 1873.592,
 'Virgin Islands': 103.969,
 'Myanmar (Burma)': 55227.152,
 'Moldova': 4013.178,
 "Cote D'Ivoire": 2774.301}
In [ ]:
for k,v in combi.items():
    data_pop.loc[(data_pop['Country'] == k) & (data_pop['PopTotal'].isnull()), 'PopTotal'] = v
In [ ]:
nan_sum = data_pop.isna().sum()           
nan_sum[nan_sum > 0].sort_values(ascending=False)
Out[ ]:
Region              56163
Use                  3899
Borrower genders     3888
dtype: int64
In [ ]:
data_pop.loc[data_pop['Country'] == 'Bolivia']
Out[ ]:
Funded Requested Difference amount Covered amount Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval Credit types PopTotal
297 600.0 600.0 0.000000 100.000000 Retail Retail to buy toys. BO Bolivia La Paz BOB 14.0 24 female monthly bronze 11992.647
333 875.0 875.0 0.000000 100.000000 Construction Supplies Construction to buy construction materials to finish a job. BO Bolivia Cochabamba BOB 20.0 14 male monthly bronze 11992.647
420 2550.0 2550.0 0.000000 100.000000 Crafts Arts to buy fabric BO Bolivia La Paz / El Alto BOB 7.0 84 female, female, female, female, female, female... monthly bronze 11992.647
500 4900.0 4900.0 0.000000 100.000000 Clothing Sales Clothing to buy clothes. BO Bolivia La Paz / El Alto BOB 9.0 131 female, female, female, female, female, female... monthly bronze 11992.647
575 5150.0 5150.0 0.000000 100.000000 General Store Retail to buy supplies to sell BO Bolivia La Paz / El Alto BOB 9.0 138 female, female, female, female, female, female... monthly bronze 11992.647
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
646762 50.0 4925.0 98.984772 1.015228 Auto Repair Services to buy tools for his services. BO Bolivia Monteagudo BOB 62.0 2 male monthly bronze 11992.647
646815 0.0 725.0 100.000000 0.000000 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 female monthly bronze 11992.647
646816 50.0 875.0 94.285714 5.714286 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 2 female, female monthly bronze 11992.647
646819 50.0 725.0 93.103448 6.896552 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 2 female monthly bronze 11992.647
646831 0.0 875.0 100.000000 0.000000 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 female, female monthly bronze 11992.647

8799 rows × 17 columns

In [ ]:
data_pop.rename(mapper={'PopTotal':'Population'}, axis = 1, inplace = True)
data_pop.columns
Out[ ]:
Index(['Funded', 'Requested', 'Difference amount', 'Covered amount',
       'Activity', 'Sector', 'Use', 'Country code', 'Country', 'Region',
       'Currency', 'Term in months', 'Lenders', 'Borrower genders',
       'Repayment interval', 'Credit types', 'Population'],
      dtype='object')

'Funded allocation'¶

In [ ]:
data_pop.insert(4, 'Funded allocation',( data_pop['Funded'] / data_pop['Population'])*100)
data_pop
Out[ ]:
Funded Requested Difference amount Covered amount Funded allocation Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval Credit types Population
0 300.0 300.0 0.0 100.0 0.130725 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular bronze 229488.996
1 575.0 575.0 0.0 100.0 0.250557 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular bronze 229488.996
2 150.0 150.0 0.0 100.0 0.010664 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet bronze 1406631.781
3 200.0 200.0 0.0 100.0 0.087150 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular bronze 229488.996
4 400.0 400.0 0.0 100.0 0.174300 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly bronze 229488.996
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
646828 0.0 25.0 100.0 0.0 0.000000 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly bronze 7305.842
646829 0.0 25.0 100.0 0.0 0.000000 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 female monthly bronze 229488.996
646830 0.0 125.0 100.0 0.0 0.000000 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 female, female monthly bronze 131562.775
646831 0.0 875.0 100.0 0.0 0.000000 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 female, female monthly bronze 11992.647
646832 0.0 250.0 100.0 0.0 0.000000 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 female monthly bronze 32395.454

646833 rows × 18 columns

'Borrowers'¶

In [ ]:
# so as to be able to check whether the command worked, a copy of the data set is created
data_bor = data_pop.copy()
In [ ]:
def count_borrowers(x):
    try:
        if isinstance(float(x), float):
            return np.nan
    except ValueError:
        return len(x.replace(' ', '').split(','))
In [ ]:
data_bor.insert(18, 'Borrowers', data_bor['Borrower genders'].apply(lambda x: count_borrowers(x)))
data_bor
Out[ ]:
Funded Requested Difference amount Covered amount Funded allocation Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval Credit types Population Borrowers
0 300.0 300.0 0.0 100.0 0.130725 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular bronze 229488.996 1.0
1 575.0 575.0 0.0 100.0 0.250557 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular bronze 229488.996 2.0
2 150.0 150.0 0.0 100.0 0.010664 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet bronze 1406631.781 1.0
3 200.0 200.0 0.0 100.0 0.087150 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular bronze 229488.996 1.0
4 400.0 400.0 0.0 100.0 0.174300 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly bronze 229488.996 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
646828 0.0 25.0 100.0 0.0 0.000000 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly bronze 7305.842 1.0
646829 0.0 25.0 100.0 0.0 0.000000 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 female monthly bronze 229488.996 1.0
646830 0.0 125.0 100.0 0.0 0.000000 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 female, female monthly bronze 131562.775 2.0
646831 0.0 875.0 100.0 0.0 0.000000 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 female, female monthly bronze 11992.647 2.0
646832 0.0 250.0 100.0 0.0 0.000000 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 female monthly bronze 32395.454 1.0

646833 rows × 19 columns

Outliers¶

In [ ]:
# sns.pairplot(data, corner=True)
  • Interpretation:

    • The actual data set is a split out of a bigger record, so the data points that would fall in the same category as our outliers could be part of a different set. A helpful indication would be the answer whether the data sets split was random or if the registeration/collection of the data is time correlated.

    • The analysis of the platform activities is conducted to extract the specifics of the platform for promotion and attraction of new participants, but not to investigate the platfom in relation to other competitors. Therefore, these outliers could also be significant attributes of our platform.

    • Considering these cases, for now the outliers would be part of this mini set and might be reconsidered in the analysis part according to the required situation.

Memory reduction¶

  • To Do:
    • check for data types
    • check for memory usage
    • object data types convert into categroy data types
    • change the float data type of the columns 'Funded', 'Requested', and 'Term in months' into integer type
    • ? downcast the column of float data type
    • determine the percentage of total memory reduction
    • check for correctness
    • save the data set as pickle data type
In [ ]:
# report the data type of each column
data_bor.dtypes
Out[ ]:
Funded                 float64
Requested              float64
Difference amount      float64
Covered amount         float64
Funded allocation      float64
Activity                object
Sector                  object
Use                     object
Country code            object
Country                 object
Region                  object
Currency                object
Term in months         float64
Lenders                  int64
Borrower genders        object
Repayment interval      object
Credit types          category
Population             float64
Borrowers              float64
dtype: object
In [ ]:
# display the memory usage by each column
data_bor.memory_usage(deep=True)
Out[ ]:
Index                  5174664
Funded                 5174664
Requested              5174664
Difference amount      5174664
Covered amount         5174664
Funded allocation      5174664
Activity              44777911
Sector                41969385
Use                   75994909
Country code          38163147
Country               42240273
Region                44415501
Currency              38809980
Term in months         5174664
Lenders                5174664
Borrower genders      45391955
Repayment interval    41831080
Credit types            647257
Population             5174664
Borrowers              5174664
dtype: int64
In [ ]:
# create a copy
df_reduct = data_bor.copy()
In [ ]:
# convert object data types into category data types
for col in ['Activity', 'Sector', 'Country code', 'Country', 'Use', 'Currency', 'Region', 'Repayment interval', 'Borrower genders']:
    df_reduct[col] = df_reduct[col].astype('category')
In [ ]:
# convert float daty types into integer data types
df_reduct[['Funded', 'Requested','Term in months']] = df_reduct[['Funded', 'Requested','Term in months']].apply(pd.to_numeric, downcast='integer')
In [ ]:
# downcast the integer data type
df_reduct[['Difference amount', 'Covered amount', 'Population', 'Borrowers']] = df_reduct[['Difference amount', 'Covered amount', 'Population', 'Borrowers']].apply(pd.to_numeric, downcast='float')
In [ ]:
df_reduct[['Lenders']] = df_reduct[['Lenders']].apply(pd.to_numeric, downcast='integer')
In [ ]:
# check
df_reduct.dtypes
Out[ ]:
Funded                   int32
Requested                int32
Difference amount      float32
Covered amount         float32
Funded allocation      float64
Activity              category
Sector                category
Use                   category
Country code          category
Country               category
Region                category
Currency              category
Term in months           int16
Lenders                  int16
Borrower genders      category
Repayment interval    category
Credit types          category
Population             float32
Borrowers              float32
dtype: object
In [ ]:
# total memory storage space reduction
reduction = (data_bor.memory_usage(deep=True).sum()-df_reduct.memory_usage(deep=True).sum())/data_bor.memory_usage(deep=True).sum()
print(f"{reduction:0f}")
0.758987

Save the data set¶

In [ ]:
# save the data set as pickle
df_reduct.to_pickle('output/crowdfunding_datapreprocessed.pkl')
In [ ]:
# check
df = pd.read_pickle('output/crowdfunding_datapreprocessed.pkl')
df
Out[ ]:
Funded Requested Difference amount Covered amount Funded allocation Activity Sector Use Country code Country Region Currency Term in months Lenders Borrower genders Repayment interval Credit types Population Borrowers
0 300 300 0.0 100.0 0.130725 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12 12 female irregular bronze 2.294890e+05 1.0
1 575 575 0.0 100.0 0.250557 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11 14 female, female irregular bronze 2.294890e+05 2.0
2 150 150 0.0 100.0 0.010664 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43 6 female bullet bronze 1.406632e+06 1.0
3 200 200 0.0 100.0 0.087150 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11 8 female irregular bronze 2.294890e+05 1.0
4 400 400 0.0 100.0 0.174300 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14 16 female monthly bronze 2.294890e+05 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
646828 0 25 100.0 0.0 0.000000 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13 0 female monthly bronze 7.305842e+03 1.0
646829 0 25 100.0 0.0 0.000000 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13 0 female monthly bronze 2.294890e+05 1.0
646830 0 125 100.0 0.0 0.000000 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13 0 female, female monthly bronze 1.315628e+05 2.0
646831 0 875 100.0 0.0 0.000000 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13 0 female, female monthly bronze 1.199265e+04 2.0
646832 0 250 100.0 0.0 0.000000 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13 0 female monthly bronze 3.239545e+04 1.0

646833 rows × 19 columns